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Abstract 

Our group analyzed a multi-institutional data set to address the question of how the outcomes of surgery for pros- 
tate cancer are affected by surgeon-specific factors. The cohort consists of 9076 patients treated by open radical 
prostatectomy at one of four US academic institutions 1987 - 2003. The primary analyses focused on 7765 patients 
without neoadjuvant therapy. The most well-known finding is that of a surgical "learning curve", with rates of pros- 
tate cancer cure strongly dependent on surgeon experience. In this "data note", we provide the raw data set, as 
well as well-annotated programming code for the main analyses. Data include markers of cancer severity (stage, 
grade and prostate-specific antigen level), cancer outcome, and surgeon variables such as training and experience. 



Introduction 

We have long been proponents of scientific data shar- 
ing, having published articles advocating sharing of raw 
data[l], guidelines for preparing data sets[2,3] and 
empirical studies of authors' willingness to share data [4]. 

In some areas of science, such as microarray research, 
there are publicly available websites for depositing data 
(e.g. Gene Expression Omnibus or the Stanford Micro- 
array Database). For many scientific fields, however, the 
primary means to share data is publication of supple- 
mentary files on the journal website. Yet journals vary 
in their policies as to supplemental files and, as such, an 
author might wish to make available the raw data from 
a study, but have no obvious venue for data publication. 
Moreover, investigators often conduct multiple analyses 
on a data set, publishing several different papers. This is 
a problem on the grounds that it is ideal for analytic 
code to be published alongside raw data sets. Even if a 
journal did agree to post supplemental raw data files to 
their website, it is unlikely that they would be sympa- 
thetic to publishing a comprehensive set of program- 
ming code encompassing analyses for papers previously 
published in other journals. 

BMC Research Notes provides an excellent venue for 
posting data sets from studies published elsewhere in 
the literature. This is not only because BMC has liberal 
policies as to supplemental files, but because Research 
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Notes is very flexible as to the form of scientific articles. 
This paper has been conceived as example of the sort of 
paper that might be published in Research Notes that 
serves primarily as a place holder for associated supple- 
mental files, containing raw data and programming 
code. 

Research on the learning curve 

In the early part of the 2000's, it became apparent that 
the results of cancer surgery could vary between sur- 
geons, sometimes dramatically. Begg et al[5], for exam- 
ple, published data showing that many more surgeons 
that would be expected by chance had either very high 
or very low rates of surgical complications after radical 
prostatectomy. Colleagues of ours at Memorial Sloan- 
Kettering Cancer Center were interested in whether 
outcome variation might extend to cure rates. We colla- 
borated with colleagues at the Cleveland Clinic and 
Wayne State in order to develop a large data set of 
patients undergoing radical prostatectomy. The data set 
included information as to patient's baseline risk (stage 
and grade of cancer, and level of prostate specific anti- 
gen), the surgeon who treated the patient, and the 
patient's outcome (date of relapse or last follow-up). On 
the basis of this data set, we have published 6 separate 
studies: 

1. We demonstrated a "learning curve" for cancer 
control after radical prostatectomy. Patients treated 
by inexperienced surgeons were much more likely to 
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recur than patients treated by more experience 
colleagues [6]. 

2. We found that the learning curve did not vary by 
either pre-operative risk[7] or pathologic stage[8]. 
Cure rates close to 100% in patients with organ-con- 
fined disease treated by the most experienced sur- 
geons were taken to indicate that recurrences in 
such patients are primarily a function of surgical 
technique. 

3. Recurrence rates vary between surgeons, even 
after adjusting for experience^]. In other words, a 
patient may have a different chance of cure depend- 
ing on which of two surgeons he sees, even if the 
two surgeons have conducted a similar number of 
previous radical prostatectomies. 

4. There is a learning curve for surgical margins, 
although the poor concordance between a surgeon's 
margin and recurrence rates suggests that the former 
is not a good surrogate for the latter[10]. 

5. The surgical learning curve differs depending on 
fellowship training. Surgeons without fellowship 
training initially have similar recurrence results to 
their fellowship-trained colleagues but then fail to 
improve with experience. Conversely, the learning 
curve for surgical margin status did not differ by fel- 
lowship training. This suggests that there are differ- 
ent mechanisms of surgical learning for surgical 
margins and recurrence[ll]. 

As our series of studies came to an end, our collabora- 
tors agreed to make the data set freely available for 
other investigators to use. 

Statement as to patient consent 

Consent for publication of raw data was not obtained 
from participants. All data were obtained as part of rou- 
tine clinical practice and were downloaded for research 
purposes under IRB waivers for retrospective data. 
Accordingly, consent for data publication was not 
obtained from patients before data were initially 
received. Obtaining consent retrospectively would be 
infeasible as there were over 9000 patients, some treated 
more than 20 years ago, and many of whom have died. 
Nonetheless, the dataset is fully anonymous in a manner 
that can easily be verified by any user of the dataset. 
Patients and surgeons are identified only by an anon- 
ymous code; there are no identifying data such as name, 
address or social security numbers; patient age is subject 
to random jitter; the age of patients who were unusually 
old or young at the time of surgery is modified to ceil- 
ing and floor values. As such, publication of the dataset 
clearly and obviously presents minimal risk to confiden- 
tiality of study participants. 



Supplemental files: data and statistical code 

Given below is a list of files that we have uploaded to 
BMC Research Notes in order to make our data avail- 
able. Following several prior recommendations [12], we 
have also uploaded statistical code to allow replication 
of our results. The code is saved as Stata "do" files, but 
these can be opened from within a text editor or word 
processing package such as Microsoft Word. The code 
has been well-annotated, we hope sufficiently so to 
allow non-Stata users to follow our logic. We created 
over 100 do files for the numerous papers associated 
with our learning curve studies. Publishing all of these 
do files would more likely lead to confusion than 
insight. As such, we selected a sub-set of representative 
analyses that we believe would allow any competent 
analyst to replicate our results. For example, we provide 
code for a sensitivity analysis that includes only sur- 
geons whose career experience was at least 100 cases; 
this code is easily adapted for a sensitivity analysis that 
includes surgeons with career experience of 250 or 
more cases. 

Moreover, some of the code was originally written in a 
highly modular fashion, with kernels of code referenced 
by numerous different do files, with extensive routines 
for printing out results in a readable form (e.g. rounding 
p values). Both features can make our programming dif- 
ficult to follow. Accordingly, we simplified the code for 
this presentation, removing code associated with presen- 
tation, and duplicating code in different do files in some 
cases. We also wrote new code to deidentify the data set. 

We estimate that the total time taken to prepare the 
data and code for publication was 8 hours. While far 
from trivial, this constitutes a small fraction of the effort 
spent on the data set over the past five years. Moreover, 
this estimate must be seen as higher than typical, given 
that the code involved covered so many different papers. 

The data have been uploaded both in Stata format, 
and a raw format that can be read by most software (it 
can be opened directly in Microsoft Excel, for example). 
These two files are named "master learning curve data 
set deidentified" with ".dta" and ".raw" extension respec- 
tively. "Variable labels.pdf" describes each variable on 
the data set [see Additional files 1, 2 and 3]. A descrip- 
tion of each do file is as follows: 

1. 01 deidentify data learning curve.do [Additional 
file 4] 

This do file takes the data set with identifying 
information and saves out a new data set without 
any identifying information. This includes 
removing patient and surgeon identifiers and 
replacing them with anonymous identifiers, 
removing dates, and ensuring that patient age 
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cannot identify individuals. Before saving out the 
deidentified data set, a data set is saved with 
both the true and anonymous patient and sur- 
geon identifiers; this data set is not published, 
but is kept with the primary investigators so that 
any data enquiries about individual patients can 
be addressed by the primary investigator. 

2. 02 primary analysis bcr learning curve. do 
[Additional file 5] 

This do file performs the primary analysis of the 
learning curve for biochemical recurrence[6]. 
This is an example of the code to produce a 
learning curve for a survival-time outcome. A 
multivariable analysis is performed to obtain the 
adjusted p-value for the association between 
surgeon experience and outcome; the adjusted 
5-year predicted probability of freedom from bio- 
chemical recurrence is plotted against surgeon 
experience; and the central estimates for 10 and 
250 prior cases are displayed. 

3. 03 bootstrap ci for difference in 10 vs 250 bcr 
learning curve.do [Additional file 6] 

This do file uses bootstrap resampling to con- 
struct a 95% confidence interval for the differ- 
ence in adjusted 5-year probability of 
biochemical recurrence for a patient treated by a 
surgeon with 10 vs 250 prior cases[6]. The out- 
put from the bootstrap resampling is saved as a 
Stata data set "output bootstrap ci for difference 
10 vs 250 learning curve. dta". This is an example 
of code where bootstrap resampling is used to 
obtain confidence intervals for an estimate whose 
sampling distribution is unknown. The code 
could be modified easily for another estimate of 
interest, for example, the difference in adjusted 
probability of positive surgical margins for a 
patient treated by a surgeon with 10 vs 250 prior 
cases. 

4. 04 sensitivity analysis patients treated after 
1995 bcr learning curve.do [Additional file 7] 

This do file performs the same analysis as done 
in "02 primary analysis bcr learning curve.do", 
except that the cohort is restricted to patients 
treated after 1995, when stage migration related 
to the advent of PSA screening appeared to be 
largely complete[6]. This is an example of the 
code where a specific group of patients is 
included, and another group excluded. This code 
could be modified easily to restrict the analysis 
to a different subgroup, for example, patients 
with low risk disease. 

5. 05 sensitivity analysis surgeons with at least 
100 total cases bcr learning curve.do [Additional 
file 8] 



This do file performs the same analysis as done 
in "02 primary analysis bcr learning curve.do", 
except that the cohort is restricted to surgeons 
who completed at least 100 total cases. This 
sensitivity analysis was performed to confirm 
that the relationship between surgeon and 
experience and outcome was not confounded 
by the ability of individual surgeons to attract 
patients (i.e., a less capable surgeon who was 
unable to establish a practice would therefore 
contribute to the beginning but not the end of 
the learning curve) [6]. This is an example of 
the code where only patients treated by a speci- 
fic group of surgeons are included. This code 
could be modified easily to restrict the analysis 
to patients treated by a different group of sur- 
geons, for example, surgeons who completed at 
least 250 total cases. 

6. 06 separately by postoperative risk bcr learning 
curve.do [Additional file 9] 

This do file performs the primary analysis of the 
learning curve for biochemical recurrence sepa- 
rately by pathologic stage [8]. This is an example 
of the code to produce a learning curve sepa- 
rately for different subgroups of patients. A mul- 
tivariable analysis is performed to obtain the 
adjusted p-value for the association between sur- 
geon experience and outcome separately for 
those with organ-confined and non-organ-con- 
fined disease; the adjusted 5-year predicted prob- 
ability of freedom from biochemical recurrence 
is plotted against surgeon experience separately 
by pathologic stage; and the central estimates for 
10 and 250 prior cases are displayed. This code 
could be modified to obtain separate learning 
curves for subgroups defined in other ways, for 
example, patients treated by fellowship vs. non- 
fellowship trained surgeons. 

7. 07 surgical margins learning curve.do [Addi- 
tional file 10] 

This do file performs the primary analysis of the 
learning curve for surgical margins[10]. This is 
an example of the code to produce a learning 
curve for a binary outcome. A multivariable ana- 
lysis is performed to obtain the adjusted p-value 
for the association between surgeon experience 
and outcome; the adjusted predicted probability 
of positive surgical margin is plotted against sur- 
geon experience; and the central estimates for 10 
and 250 prior cases are displayed. This code 
could be modified easily to restrict the analysis 
to a particular subgroup of patients. 

8. 08 heterogeneity in bcr by surgeon.do [Addi- 
tional file 11] 
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This do file performs a multivariable random- 
effects model to evaluate heterogeneity between 
surgeons in biochemical recurrence outcomes 
after adjustment for case-mix and surgeon 
experience. The random effects variance, 95% 
confidence interval, and p-value are displayed [9]. 
This is an example of the code to determine 
whether heterogeneity exists between surgeons, 
and could be modified easily for different types 
of outcomes (for example, a binary outcome 
such as positive surgical margins) or different 
subgroups of patients. 

9. 09 forest plot bcr by surgeon. do [Additional 

file 12] 

This do file obtains the adjusted 5-year predicted 
probability of freedom from biochemical recur- 
rence for each surgeon; obtains a combined esti- 
mate across all surgeons using meta-analytic 
methods, and shows the probabilities and 95% 
confidence intervals for each surgeon as a forest 
plot[9]. This could be modified easily for differ- 
ent types of outcomes or different subgroups of 
patients. 

Conclusions 

Publishing scientific papers on the web provides far 
greater flexibility of form and function than is possible 
with traditional publication in a paper journal. The 
medium allows the development of new kinds of scienti- 
fic paper, such as a "Data Note", including data and sta- 
tistical code for scientific projects involving several 
different research questions and multiple papers. This 
paper aims to provide an example of the form. 

Additional material 



Additional file 7: sensitivity analysis patients treated after 1995 bcr 
learning curve.do. This is a Stata "do" file - statistical programming code - 
that performs the same analysis as done in "02 primary analysis bcr 
learning curve.do", except that the cohort is restricted to patients treated 
after 1 995. 

Additional file 8: sensitivity analysis surgeons with at least 100 
total cases bcr learning curve.do. This is a Stata "do" file - statistical 
programming code - that performs the same analysis as done in "02 
primary analysis bcr learning curve.do", except that the cohort is 
restricted to surgeons who completed at least 100 total cases. 

Additional file 9: separately by postoperative risk bcr learning 
curve.do. This is a Stata "do" file - statistical programming code - that 
performs the primary analysis of the learning curve for biochemical 
recurrence separately by pathologic stage[8]. 

Additional file 10: surgical margins learning curve.do. This is a Stata 
"do" file - statistical programming code - that performs the primary 
analysis of the learning curve for surgical margins[10]. 

Additional file 11: heterogeneity in bcr by surgeon.do This is a Stata 
"do" file - statistical programming code - that performs a multivariable 
random-effects model to evaluate heterogeneity between surgeons in 
biochemical recurrence outcomes after adjustment for case-mix and 
surgeon experience^]. 

Additional file 12: forest plot bcr by surgeon.do This is a Stata "do" 
file - statistical programming code - that obtains the adjusted 5-year 
predicted probability of freedom from biochemical recurrence for each 
surgeon; obtains a combined estimate across all surgeons using meta- 
analytic methods, and shows the probabilities and 95% confidence 
intervals for each surgeon as a forest plot[9j. 
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